Batch correction of two-panel CyTOF data

Christina Bligaard Pedersen

February 2, 2021

This vignette will demonstrate the batch correction of a CyTOF dataset, where samples were measured using two different panels. Not only will batch correction be performed, but we will also impute the non-overlapping markers allowing for a much more direct integration of these data.


This is data from a study of CLL patients and healthy donors at the Dana-Farber Cancer Institute (DFCI). Protein expression was quantified using two different panels of proteins with an overlap. The data generated with each panel was processed in eight batches. The data is B-cell depleted.


Pre-processing data

In this dataset, it seems reasonable to start by looking at the two panels.

Now, we have the panels - so let us extract the markers and identify the overlap.

 [1] "CD20"   "CD3"    "CD45RA" "CD5"    "CD19"   "CD14"   "CD33"   "CD4"   
 [9] "CD8"    "CD197"  "CD56"   "CD161"  "FoxP3"  "HLADR"  "XCL1"  

We observe that there is a total of 15 overlapping markers. These span a lot of the major cell types (eg. CD3, CD4, and CD8 for T cells, CD56 for NK cells and CD14 and CD33 for myeloid cell types).


The workflow presented in this vignette can be visualized with the following schematic. Dataset a and b are the the datasets for the two panels here.


We are now ready to load the CyTOF data. We convert it to a tibble format, which is easy to process. We use cofactor = 5 (default) for asinh-transformation.

Reading 136 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!
Reading 117 files to a flowSet..
Extracting expression data..
Your flowset is now converted into a dataframe.
Transforming data using asinh with a cofactor of 5..
Done!


Processing data - batch correction

Panel 1 - batch correction

In this case, the dataset for each panel is, as mentioned, run in eight batches - this means that there are likely some batch effects to correct for within each panel as well. We take of this first, before we start integrating across panels!


Let us have a quick look at some UMAPs to visualize the correction for each batch. We downsample so it is easier to see what is going on.

Now, let us view the expression distributions for all the markers in panel 1.

Finally, we will evaluate the EMD reduction for the batch correction of panel 1. To do this, we first need to perform a clustering of the corrected set, and we will transfer the labels to the uncorrected set for direct comparison.


Panel 2 - batch correction

Now, it is time to do the same with the data from panel 2. First, we batch correct:


We look at the UMAPs.

And we take a look at the marker distributions for panel 2

Lastly, we evaluate the EMD reduction for the batch correction of panel 2 in the same manner as presented above.


For both panels, we find an EMD reduction of 0.67 when considering the batch corrections separately.


Combined batch correction

Based on the marker distributions after correction and the UMAPs, it looks like the batch effects within the data for each panel are minimized. Now we can focus on the integration of the two sets. The first step here, is to batch correct the datasets based on the overlapping markers.


Similarly to the corrections within each panel’s data, we can now look at the UMAPs and marker distributions before and after correction - and we can also calculate the EMD reduction.

And the EMD reduction:

For this correction, we obtain an EMD of 0.85.


Imputing non-overlapping markers

Now that we have batch corrected the overlapping markers from the two panels, they are directly comparable. However, when limiting ourselves only to the overlapping set, we also remove all information contained in the non-overlapping markers. In this case, panel 1 contains 21 markers not found in panel 2 - and panel 2 has 19 markers not found in panel 1. These markers include CD16, which is important for NK cells and monocyte distinction and Granzyme A, which is important to deeply characterize cytotoxic T cells and NK cells.

We want to include these markers in our dataset, but because the non-overlapping markers were only measured on roughly half of the cells, we have to use imputation to provide a value for the other panel’s cells.

We start by defining the sets of non-overlapping markers and then add the values for these to the batch corrected values for the overlapping markers for each panel.


We are now ready to impute the values for the markers unique panel 2 for the panel 1 data - and vice versa.

Creating SOM grid..
Mapping data to SOM
[1] "Performing density draws for dataset1"
[1] "Performing density draws for dataset2"


Now, we are ready to use this combined dataset to answer the biological questions and make nice visualizations. Let us look at all of the markers after batch corrections and imputation.

This looks very nice in terms of obtaining comparable distributions between the cells originating from each panel.

Let us make a UMAP for the combined set - based on all 55 markers (15 overlapping + 21 unique to panel 1 + 19 unique to panel 2).

After running this process, it is possible to perform any processing one finds relevant for the dataset. In this case, we proceeded to cluster the data using SOM and ConsensusClusterPlus on the following 20 markers: CD3, CD45RA, CD14, CD45RO, CD152, CD33, CD4, CD8, CD197, CD56, FoxP3, CD25, CD1c, CD1d, CD11c, CD16, CD34, CD11b, FCeR1a, XCL1

The clusters were annotated manually based on the protein expression levels. Let us load these labels and add them to the UMAP.


Finally, let us compare the cell type fractions between the data originally derived from each panel.

[1] 0.9982821